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Abstract 

This paper studies the outlier detection problem from the point of view of 
penalized regressions. The regression model adds one mean shift parameter for 
each of the n data points. We then apply a regularization favoring a sparse 
vector of mean shift parameters. The usual Li penalty yields a convex crite- 
rion, but we find that it fails to deliver a robust estimator. The Li penalty 
corresponds to soft thresholding. We introduce a thresholding (denoted by 0) 
based iterative procedure for outlier detection (B-IPOD). A version based on 
hard thresholding correctly identifies outliers on some hard test problems. We 
describe the connection between G-IPOD and M-estimators. Our proposed 
method has one tuning parameter with which to both identify outliers and es- 
timate regression coefficients. A data-dependent choice can be made based on 



BIC. The tuned 0-IPOD shows outstanding performance in identifying outhers 
in various situations in comparison to other existing approaches. In addition, 
we find that 0-IPOD is much faster than iteratively reweighted least squares 
for large data because each iteration costs at most 0{np) (and sometimes much 
less) avoiding an 0{np'^) least squares estimate. This methodology extends to 
high-dimensional modeling with p ^ n, if both the coefficient vector and the 
outlier pattern are sparse. 

Keywords: M-estimation, sparsity, robust regression, thresholding 

1 Introduction 

Outliers are a pervasive problem in statistical data analysis. Nonrigorously, out- 
liers refer to one or more observations that are different from the bulk of the data. 
Hampel et al. (1986) estimate that a routine data set may contain about 1-10% (or 
more) outliers. Unfortunately, outliers often go unnoticed (Rousseeuw & Leroy 1987), 
although they may have serious effects in estimation, inference, and model selec- 
tion (Weisberg 1985). Perhaps the most popular statistical modeling method is or- 
dinary least squares (OLS) regression. OLS is very sensitive to outliers — a single 
unusual observation may break it down completely. Our goal in this work is outlier 
identification for regression models, together with robust coefficient estimation. 

We consider the linear regression model given by j/ = X/3 + e where X G M"^^ 
is a fixed matrix of predictors, /3 G is a fixed (unknown) coefficient vector and 
e G is a random error vector. The i'th case is written yi = xj f3 + e^. 

Suspected outliers are most commonly found by looking at residuals = yi — xj^ 
where ^ is the OLS estimate of (3. It is well-known that such raw residuals can fail 
to detect outliers at leverage points. A better way to detect an outlier is the leave- 
one-out approach (Weisberg 1985). If the i'th case is suspected to be an outlier, then 
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we compute the externally studentized residual 

t. = ~ ^^^w fii) 

' a(.)(l + =r7(Xj)X(,))-ia;,)V2 

where ^(j), and (3"(j) are the predictor matrix, coefficient estimate and scale 
estimate respectively, based on n — 1 observations, leaving out the z'th. Large values 
\ti\ > T] are then taken to suggest that observation i is an outlier. The threshold 
r] = 2.5 (Rousseeuw & Leroy 1987) is a reasonable choice. If e ~ A/'(0, a^/) then 
ti ~ t(„-p_i) and we can even attach a significance level to \ti\. After removing an 
apparent outlier from the data, one then looks for others. 

Studentized residuals, and other leave-one-out methods such as Cook's distance 
and DFFITS, are simple and effective when there is only one outlier. When there are 
multiple outliers, these simple methods can fail. Two phenomena have been remarked 
on. In masking, when an outlying i'th case has been left out, the remaining outliers 
cause either a large value of (T(j) or a small value of \yi — a;7/3(j)|, or both, and as a 
result observation i does not look like an outlier. Therefore, multiple outliers may 
mask each other and go undetected. In swamping, the effect of outliers is to make 
lUi — a;7/3(i)| large for a non-outlying case i. Swamping could lead one to delete good 
observations and becomes more serious in the presence of multiple outliers. 

In this paper we take the studentized residual as our starting point. The t-test for 
whether observation i' is an outlier is the same as testing whether the parameter 7 is 
zero in the regression y = X f3+'-fli=i' + e. Because we don't know which observations 
might be outliers, we use a model 

t/ = X/3 + 7 + e, e^M{0,a^I), (1.2) 

in which the parameter 7^ is nonzero when observation i is an outlier. This formulation 
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was earlier used by Gannaz (2006) and McCann & Welsch (2007). This mean-shift 
model allows any combination of observations to be outliers. It has n + p regression 
parameters and only n data points. Our approach is to fit (1.2) imposing sparsity 
on 7 in order to avoid the trivial estimate 7 = 1/ to get a meaningful estimate 
of f3. The resulting algorithm is called thresholding (denoted by 0) based iterative 
procedure for outlier detection, or 0-IPOD for short. 

All of our proposals (apart from one exception noted where it arises) require a 
preliminary robust regression to be run. This practice is in line with the best current 
robust regression methods. The preliminary regression supplies a robust estimate of 
(3, and usually a robust estimate of a as well. The robust regression methods that 
we compare to are known to outperform the preliminary regressions that they use as 
starting points. We will compare 0-lPOD to those best performing methods. 

The rest of the paper is organized as follows. Section 2 surveys the literature 
on robust regression. Section 3 develops the soft-IPOD algorithm which fits (1.2) 
using an Li penalty on 7. This algorithm minimizes a convex criterion, but it is not 
robust. Section 4 develops a family of algorithms replacing soft-thresholding by a 
general thresholding rule G. We find that some nonconvex criteria properly identify 
multiple outliers in some standard challenging test cases. Section 5 investigates the 
computational efficiency of 0-lPOD in comparison to iteratively reweighted least 
squares (IRLS). IRLS requires a QR decomposition at each iteration, while 6-IPOD 
needs only one. As a result, though possibly requiring more iterations, it is much 
faster in large problems that we investigate. In Section 6, we discuss the important 
problem of parameter tuning in outlier detection and carry out an empirical study 
to demonstrate the advantage of our penalized approach. Section 7 extends the 
technique to high-dimensional data with p > n. Our conclusions are in Section 8. 
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2 Survey of robust regression 

Many methods have been developed for multiple outlier identification. Robust or re- 
sistant regressions, such as M-estimators (Huber 1981) and Least Trimmed Squares 
(LTS) (Rousseeuw & Leroy 1987), were proposed to provide a trustworthy coefficient 
estimate even in the presence of multiple outliers. There are also clustering-based pro- 
cedures, such as Serbert et al. (1998), and some informal methods based on graphics. 
Multiple outlier detection procedures usually alternate between two steps. One step 
scans regression output (coefficient and variance estimates) to identify seemingly clean 
observations. The other fits a linear regression model to those clean observations. The 
algorithm can be initialized with OLS, but generally it is better to initialize it with 
something more robust. 

If a data set contains more than one outlier, masking may occur and the task 
of outlier detection is much more challenging. Well-known examples of masking in- 
clude the Belgian telephone data and the Hertzsprung-Russell star data (Rousseeuw 
& Leroy 1987), as well as some artificial datasets like the Hawkins-Bradu-Kass (HBK) 
data (Hawkins, Bradu & Kass 1984) and the Hadi-Simonoff (US) data (Hadi & 
Simonoff 1993). The HS and HBK data sets both have swamping effects. 

The main challenge in multiple outlier detection is to counter masking and swamp- 
ing effects. Hadi & Simonoff (1993) describe two broad classes of algorithms — direct 
methods and indirect methods. The direct procedures include forward search algo- 
rithms (Hadi &: Simonoff 1997; Pena & Yohai 1995; Atkinson & Riani 2000) and back- 
ward selection (Menjoge & Welsch 2010) among others. Indirect methods are those 
that use residuals from a robust regression estimate to identify the outliers. Exam- 
ples of indirect methods include Least Median of Squares (LMS) (Rousseeuw 1984), 
Least Trimmed Squares (LTS) (Rousseeuw & Leroy 1987), S-estimators (Rousseeuw 
& Yohai 1984), MM-estimators (Yohai 1987), one-step GM estimators (Simpson 
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et al. 1992) and SIS estimators (Coakley & Hettmansperger 1993). Almost all start 
with an initial high breakdown point estimate (not necessarily efficient) say from 
ITS, S, MTS (Nguyen & Welsch 2010), or Pefia & Yohai's (1999) (denoted by PY) 
fast procedure. Most published examples in the outlier detection literature are in 10 
or fewer dimensions. Methods with high breakdown point typically have costs that 
grow exponentially in the dimension. In practice, when there are a large number 
of predictors, PY can be applied to provide an initial estimate with certain robust- 
ness, although in theory its breakdown point property is not well established (Pena 
& Yohai 1999). 

It is worth mentioning that outlier identification and robust regression are two 
closely related but not quite identical problems (Rousseeuw & Leroy 1987; Maronna 
et al. 2006). Even if we perfectly identified the regression coefficient vector, there 
could still be some overlap between the residual distributions for good and outlying 
points. That is, given the true regression coefficient (3 there would be type one and 
type two errors in trying to identify outliers. On the other hand, if we could perfectly 
identify all gross outliers, it is a relatively easy task to obtain a robust coefficient 
estimate, as will be supported by our experiments in Section 6. 

3 Soft-IPOD 

We will use the mean shift model (1.2) from the introduction which predicts y by 
the usual linear model X(3 plus an outlier term 7. If 7^ = then the z'th case is 
good, and otherwise it is an outlier. Our goals are to find a robust estimate of (3 as 
well as to estimate 7 thereby identifying which cases are outliers and which are not. 
We assume that 7 is sparse because outliers should not be the norm. We suppose 
at ffist that n > p and that X = [xi, . . . , Xn]^ has full rank p. Section 7 considers 
the case when /3 is sparse, too, with p possibly greater than n. Yet the majority of 
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our paper focuses on the outlier problem only. Let H be the hat matrix defined by 
H = H{X) = X{X'^X)-^X'^. The i'th diagonal entry of if, denoted hi, is called 
the leverage of the i'th observation. 

The assumed sparsity of 7 motivates using an Li-penalized regression to minimize 

1 " 
fsoM 7; A) = -\\y-Xf3- + J2 ^il7.| (3.1) 

i=l 

over f3 and 7. The choice of A G [0, 00)" is discussed below. The function /soft in 
(3.1) is jointly convex in f3 and 7. Its simple form suggests that an alternating opti- 
mization can be applied: given 7, the optimal (3 is the OLS estimate from regression 
of 7/ — 7 on X; given (3, this Li-penalized problem is orthogonal and separable in 7, 
and the optimal 7 can be obtained by soft-thresholding. This alternation between 
updates to f3 and 7 closely matches the interplay between detecting outliers and fit- 
ting regression parameters mentioned in Section 2. We call the resulting algorithm 
soft-IPOD because it is based on soft thresholding. 

We can derive a principled choice of A by comparing to Donoho & Johnstone's 
(1994) classical work which states that A(?7,) = ay/OAogn is minimax optimal when a 
is known and the predictors are orthogonal so that the residuals are uncorrelated. 
Our residuals are correlated. For example, at the first step r = y — Xf3^^^ ~ 
N'{H^, (t'^{I — H)). This motivates taking Aj = crA/2(l — hi) logn and then Sidak's 
inequality (Sidak 1967) controls the probability of wrongly declaring 7j 7^ 0. It is con- 
servative. If there are many outliers, then a is not easily found. As a result, we prefer 
to set Xi = X\/l — hi and tune the regular izat ion parameter A in a data-dependent 
way. See Sections 4 and 6 for details. 

To initialize the algorithm we must specify either f3 or 7. Because /soft is convex 
in 7) the starting point is not crucial. We could use 7 = for example. (For 
a detailed discussion of the initial estimate in the general situation, see Section 6 
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and Section 8.) A pseudo-code outline for soft-IPOD is deferred to Section 4. Soft- 
IPOD corresponds to Algorithm 1 (G-IPOD) there, with a soft thresholding rule G. 
Soft-IPOD is guaranteed to converge, even though it is not contractive. See Section 
4.2. There are other alternatives in computation for solving the convex optimization 
problem or its variants, such as the LARS (Efron et al. 2004) used in McCann & 
Welsch (2007). However, we shall see the Soft-IPOD offers a great advantage in 
generalizing the methodology to nonconvex penalized least-squares. 

Although (3.1) is a well-formulated model and soft-IPOD is computationally effi- 
cient, in the presence of multiple outliers with moderate or high leverage values, this 
method fails to remove masking and swamping effects. Take the artificial HBK data 
as an illustration. Using a robust estimate & from LTS, and Aj = 6'a/2(1 — hi) logn, 
we obtained 7 = [0, ■ ■ ■ ,0, —8.6, —9.7, —7.6, —8.4, 0, ■ ■ ■ , 0]^ which identifies cases 11- 
14 as serious outliers, while the true 7 is [10, ■ ■ ■ , 10, 0, ■ ■ ■ , 0]""" with cases 1-10 being 
the actual outliers. This erroneous identification is not a matter of parameter tuning. 
Figure 1 plots the soft-IPOD solution over the relevant range for A. Whatever value 
of A we choose, cases 11-14 are sure to be swamped, and cases 1-10 are very likely 
to be masked. The approach is not able to identify the correct outliers without 
swamping. Our extensive experience shows that this Li technique hardly works for 
any benchmark dataset in the outlier detection literature. According to Rousseeuw 
& Leroy (1987), a convex criterion is inherently incompatible with robustness. 

Next we delineate the parallels between soft-IPOD and Ruber's M-estimate regres- 
sion which is similarly non-robust. As pointed out by Gannaz (2006) and Antoniadis 
(2007), there is a connection between the Li-penalized regression (3.1) and Ruber's 
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Soft-IPOD Path 
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c (= Il7lll/I|r„,,||i) 



Figure 1: The solution path of the Li penalized regression on the HBK data. The estimates 
7j(A) (i = 1, 2, • • • , 75) are plotted against the constraint value c, given by ||7||i/||'"ois||i (the 
dual parameter of A), where rois is the OLS residual vector. The top 10 paths correspond 
to cases 1-10, the true outliers. The bottom four correspond to cases 11-14 which are all 
good leverage points, but possibly swamped. {Figures are available in color online.) 

M-estimate. Ruber's loss function is 

{A|t|-AV2, if|t|>A 
tV2, if \t\ < A. 

Ruber's method with concomitant scale estimation, minimizes 



l{(3,a)^J2P (^^^; A ) a + ^-cna (3.2) 
i=i ^ ^ ' 

over /3 and a jointly, where c > and A > are given constants. We can prove 
the following result, which is slightly more general than the version given by Gannaz 
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(2006) and Antoniadis (2007). 

Proposition 3.1. For any c > and A > suppose that we minimize 

1 2 1 

9{/3,l,(r) = — ||2/-X/3-7||2 + -cn(T + A||7||i, (3.3) 

over (3, 7 and a. Then the minimizing f3 and o match those from minimizing Ru- 
ber's criterion (3.2). For any fixed a > minimizing (3.3) over f5 and 7 yields the 
minimizer of (3.2) over (3. 

This connection is lielpful to understand tlie inherent difficulty with Li-penahzed 
regression described earlier. It is well known that Ruber's method cannot even handle 
moderate leverage points well (Huber 1981, p. 192) and is prone to masking and 
swamping in outlier detection. Its break-down point is 0. 

A promising way to improve (3.1) is to adopt a different penalty function, possibly 
nonconvex. Our launching point in this paper is, however, the operator 6 in the above 
iterative algorithm. Substituting an appropriate 6 for soft-thresholding in the IPOD 
(see Algorithm 1), we may obtain a good estimator of 7 (and /3 as well). 

4 e-IPOD 

To deal with masking and swamping in the presence of multiple outliers, we consider 
the iterative procedure described in Section 3 using a general 9 operator, referred to as 
e-IPOD. A G-IPOD estimate is a limit point of (/3^^\ 7^^)), denoted by 0, 7). Some- 
what surprisingly, simply replacing soft-thresholding by hard-thresholding (hence- 
forth hard-IPOD) resolves the masking and swamping problem, for the challenging 
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HBK example problem. After picking a from LTS and using a zero start we obtained 



7i:io= 9.7 10.2 10.4 9.7 10.1 10.0 10.8 10.4 9.8 10.1 




and 7ii:75 = which perfectly detects the true outliers. The coefficient estimate (3 
from hard-IPOD directly gives the OLS estimate computed from the clean observa- 
tions. 

Some questions naturally arise: What optimization problem is 0-lPOD trying to 
solve? For an arbitrary 0, does 0-lPOD converge at all? Or, under what conditions 
does 0-IPOD converge? To answer these questions, we limit our discussions of O to 
thresholding rules. 

4.1 IPOD, Penalized Regressions, and M-estimators 

We begin by defining the class of threshold functions we will study. It includes well- 
known thresholds such as soft and hard thresholding, SCAD, and Tukey's bisquare. 

Definition 4.1 (Threshold function). A threshold function is a real valued function 
6(t; A) defined for — oo < t < oo with A as the parameter (0 < A < oo) such that 

1) e(-t;A) = -e(t;A), 

2) e(t; A) < 0(t'; A) for t < t', 

3) limt_^oo 0(^; A) = oo, and 

4) < e(t; A) < t for < t < oo. 

In words, 9(-; A) is an odd monotone unbounded shrinkage rule for t, at any A. 

A vector version of 6 is defined componentwise if either t or A are replaced by 
vectors. When both t and A are vectors, we assume they have the same dimension. 

Ruber's soft-thresholding rule corresponds to an absolute error, or Li penalty. 
More generally, for any thresholding rule 6(-;A), a corresponding penalty function 
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Algorithm 1 0-IPOD 



given X e R"''^, y eW, e > 0, \e [0, cx))", robust pilot estimate /3^°^ G W 
j ^ 0, 7(-'^ ^ y - Xj3, converged ^ FALSE 
(Q,i?) ^ QRdecomp(X) 

// X = R e Wf^'P is upper triangular, Q e W^'p and Q'^Q = Ip 
while not converged do 

f3^^^ ^ R-^Q'^y^'^i 

rU) ^y- X(3^^^ 

yi+i) ^ e(r(j');A) 

converged ^ ||7*-"'~'""'"^ — 7*'"'''||oo < £ 

J ^ J + 1 
end while 

deliver ^ = f3'^^~^\ 7 = 7(^') 



P = Pq can be defined. There may be multiple penalty functions for a given threshold 
as demonstrated by hard thresholding (4.7) below. The following three-step construc- 
tion finds the penalty with the smallest curvature (Antoniadis 2007; She 2009): 

Q-^{u- A) = sup{t : e(t; A) < m}, 

s{u]\) = Q~^{u]\) — u, and (4.I) 

r\e\ 

P{e;\) = P@{9;\) = s{u;\)du, 

Jo 

where m > holds throughout (4.1). The constructed penalty Pq is nonnegative and 
is continuous in 9. 

Theorem 4.1. Let Q be a thresholding rule as given by Definition 4-1 o-nd let Pq 
be the corresponding penalty defined in (4.1). Given Aj > 0, the objective function is 
defined by 

1 " 
/p(/3, -f)^-\\y-Xf3-j\\l + Y, ^*)' (4.2) 

1=1 

where P(-; ■) is any function satisfying P{9] A) — P(0; A) = Pe{,9] A) + q{9] A) where 
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A) is nonnegative and q{Q{6; A)) = for all 9. Then the Q-IPOD iteration se- 
quence satisfies 

/p(/30),70)) > /p(/3(^),7^^'+'^) > fpi^^'-^'ll^^-"'^), for any j > 0. (4.3) 

The function q will often be zero, but we use non-zero q below to demonstrate 
that multiple penalties (infinitely many, as a matter of fact) yield hard thresholding, 
including the Lo-penalty. Theorem 4.1 shows that B-IPOD converges. Any limit 
point of {(3^-'\^^^^) must be a stationary point of (4.2). 0-IPOD also gives a general 
connection between penalized regression (4.2) and M-estimators. Recall that an M- 
estimator is defined to be a solution to the score equation 

j2^(yi^^;x)x. = 0, (4.4) 
j=i ^ ^ ^ 

where A is a general parameter of the ip function. Although f3 and a can be simulta- 
neously estimated by Ruber's Proposal 2 (Huber 1981), a more common practice is 
to fix a at an initial robust estimate and then optimize over f3 (Hampel et al. 1986). 
Unless otherwise specified, we consider equation (4.4) as constraining f3 with a fixed. 

Proposition 4.1. For any thresholding rule 0(-; A), for any Q-IPOD estimate {f3, 7), 
f3 is an M-estimate associated with ip, as long as satisfies 

Q{t; X) + ipit; X) = t, \/t. (4.5) 

We have mentioned that Ruber's method or soft-IPOD behaves poorly in outlier 
detection. The problem is that it never rejects gross outliers that have moderate 
or high leverage. To reject gross outliers, redescending ?/;-functions are advocated, 
corresponding to a class of thresholdings offering little shrinkage for large components. 
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or using nonconvex penalties for solving the sparsity problem (4.2). The differences 
between 0-IPOD and the corresponding M-estimator follows: 

(i) M-estimators focus on robust estimation of f3. For an explicit sparse 7 esti- 
mate, a cutoff value is usually needed for the residuals. Minimizing fp from 
equation (4.2) directly yields a sparse 7 for outlier detection and a robust /3. 

(ii) Instead of designing a robust loss function in M-estimators, (4.2) considers 
a penalty function; A is not a criterion (loss) parameter but a regularization 
parameter that we will tune in a data-dependent way. 

Figure 2 illustrates some of the better known threshold functions along with their 
corresponding penalties, ?/;-functions and loss functions. In this article, we make use 
of the following formulas 



0, if < A 

Qsoit{x;X) = { (4.6) 

X — sgn(x)A, if |x| > A, 
0, if |x| < A 

ehard(x;A) = { (4.7) 
X, if |x| > A, 



for soft and hard thresholding, respectively. The usual penalty for hard thresholding is 
P(x; A) = 1|x.|<a(A|x| — a;^/2) + 1|j,|>aA^/2. Theorem 4.1 justifies use of the Lo-penalty 
P{x; A) = AV2 ■ 4^0 using 



q{x] A) 



ifO<|x|<A 
0, if X = or |x| > A. 
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Figure 2: The ip functions (top row), loss functions (second row), thresholding functions 
(third row), and penalty functions (bottom row) for some well known thresholding rules. 
SCAD-thresholding (Antoniadis Sz Fan 2001) is a special case of the Hampel rule. 

4.2 e-IPOD and TISP 

Algorithm 1 can be simplified. The 0-IPOD iteration can be carried out without 
recomputing f3^-'^ at each iteration. We need only update 7 via 

= Q{Hj^^^ + H)y- A), (4.8) 

at each iteration, where Aj = \\/\ — hi. The multiplication {I — H)y in (4.8) can also 
be precomputed. After getting the final 7, we can estimate f3 by OLS. The resulting 
simplified B-IPOD algorithm is given in Algorithm 2. For any given thresholding rule 
e, let /p(7) = |||(/ - H){y - 'y)\\l + Yl'Li ^(7i; -^i), where P can be any penalty 
function satisfying the conditions in Theorem 4.1. Then fp{'y^-^~^^^) < fpij^-'^) for the 
0-lPOD iterates -f'^^\ j > 0. 
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Algorithm 2 Simplified 0-lPOD 

given X G R"''^, j/ G M", A > 0, e > 0, 7^°) G W\ and threshold e(-; ■) 
j ^ 0, 7^^') ^ 7^°^ converged ^ FALSE 
H i- HatMatrix(X), r - Hy 
while not converged do 

^(j+i) ^ e(/f7(^') + r; A) 

converged ^ ||7*-''"^"'"'' — 7^-'''||oo < £ 

J ^ J + 1 
end while 

deliver 7 = 7(j) , ^ = OLScoef (X, y - 7) 
X must have full rank. 7*^"^ can he y — Xj3^^^ for a robust pilot estimate 

Setting up for Algorithm 2 costs 0{np^) for a dense regression. The dominant 
cost in a given iteration comes from computing H^^^\ Given a QR decomposition of 
H we can compute that matrix product in 0{np) work as QiQ'^y^-'^)- The cost of the 
update could be even less if ^^^^ has fewer than p nonzero entries and one maintains 
the dense matrix H. 

To give (4.8) another explanation, suppose the spectral decomposition of the hat 
matrix H is given hy H = UDU^ . Define an index set c = {z : Da = 0} and let Uc 
be formed by taking the corresponding columns of U . Then a reduced model can be 
obtained from the mean shift outlier model (1.2) 

y = A^ + e\ e' ~ A/'(0, I (^n-p),<{n-p)) , (4.9) 

where y = Ujy and A = Uj G M("~p)^". The term Xf3 has disappeared from 
the reduced model. For a special Uc, the vector y has the BLUS residuals of Theil 
(1965). The regression model (4.9) is a p > n sparsity problem. It is also a wavelet ap- 
proximation problem that Antoniadis & Fan (2001) studied in the context of wavelet 
denoising, because A satisfies AA^ = I. Furthermore, the regularized one-step 
estimator (ROSE) (Antoniadis & Fan 2001) is the first step of G-IPOD. 

We can build a connection between the reduced model and simplified B-IPOD. She 
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(2009) proposed a class of thresholding-based iterative selection procedures (TISP) 
for model selection and shrinkage. B-TISP for solving the sparsity problem (4.9) is 
given by 

7(^+^) = eiA'y/kl + (/ - A'A/klW^^- X/kl) (4.10) 

with fco equal to the largest singular value of the Gram matrix A = H. The 
iteration (4.10) reduces exactly to (4.8). Therefore, all TISP studies apply to the 
0-lPOD algorithm. For example, the TISP convergence theorem can be used to 
establish a version of Theorem 4.1, and the nonasymptotic probability bounds for 
sparsity recovery reveal masking and swamping errors. In particular, the advocated 
hard-thresholding-like 6 in TISP corresponds to a redescending ip in our outlier iden- 
tification problem. 

The simplified procedure (4.8) is easy to implement and is computationally ef- 
ficient, because the iteration does not involve complicated operations like matrix 
inversion. Model (4.9) is simpler than the original (1.2) because (3 does not appear 
and all observations are clean. They have non-outlying errors because we have moved 
the outlier variables into the regression. Using this characterization of 7, it is not 
difficult to show that the IPOD-estimate f3 satisfies the regression, scale, and affine 
equivariant properties (Hampel et al. 1986) desirable for a good robust regression 
estimator: 

(i) ^(X, y + Xri)= ^(X, y) + 77, Wrj e W; 

(ii) ^(X,c2/)=c^(X,y), VcGM; 

(in) f5{XC,y) = C~^(3{X,y), for any nonsingular C 
We are making here a mild assumption that the initial robust estimate is equivariant, 
as LTS, S and PY (Pena & Yohai 1999) are, and we're ignoring the possible effect of 
convergence criteria on equivariance. 
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It remains to select A. The clean dataset (A, y) in (4.9) has uncorrelated noise 
contamination, so we can use it to tune the parameters via BIG and related methods; 
see Section 6 for details. 

5 e-IPOD vs. IRLS 

We consider some computational issues in this section. As Proposition 4.1 sug- 
gests, 6-IPOD solves an M-estimation problem. The standard fitting algorithm 
for M-estimates is the well-known iteratively re- weighted least squares (IRLS). Let 
w{t;X) = ■ip(t;X)/t, taking 0/0 = if necessary. The ^/^-equation (4.4) that defines 
an M-estimator can be rewritten as Yl^=i ^ i^i/'^'i ''^■i^i = 0, where ri = yi — xj f3. 
Accordingly, an M-estimate corresponds to a weighted LS estimate as is well known. 
These multiplicative weights can help downweight the bad observations. Iteratively 
updating the weights yields the IRLS algorithm, which is the most common method 
for computing M-estimates. Model (1.2) indicates that M-estimation can also be 
characterized through additive effects on all observations. 

We performed a simulation to compare the speed of 0-IPOD to that of IRLS. 
The observations were generated according to y = X(3 + 7 + e, where e ~ A/'(0, /). 
The predictor matrix X is constructed as follows: first let X = C/S^^^, where 
Uij ~ f/(— 15, 15) and = p^'^^ with p = 0.5; then modify the first O rows to be 
leverage points given by L- [1, ■ ■ ■ ,1]. We consider all nine cases with L G {15, 20, 30} 
and O G {5, 10,20}. Three more cases correspond to additive outliers at O points 

that were not leverage points (no rows of X changed). The shift vector is given by 
^ = ({8}0,{0}-0)T 

Given n and p, we report the total cost of computing all M-estimates for these 
different combinations of the number of outliers (O) and leverage value (L), each 
combination simulated 10 times. The scale parameter (A) decreased from ||(/ — 
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H)y./ sj diag(/ — i/) ||oo to 0.5 with fixed step size —0.1, where ./ stands for el- 
ementwise division. The upper limit is the largest possible standardized residual. 
Empirically, the lower bound 0.5 yields approximately half of 7^ nonzero. Also for 
A < 0.5 IRLS often encountered a singular WLS during the iteration, or took excep- 
tionally long time to converge, and so could not be compared to 0-IPOD. We used 
IRLS (with fixed a = 1, as an oracle would have) and simplified 0-IPOD. The com- 
mon convergence criterion was ||7*--'''~^'' — 7'-''^||oo < 10~^. We studied all sample sizes 
n G {30,50,100,200,300,400,500,600,700,800,900,1000} and we took p = n/lO. 
The CPU times (in seconds) are plotted against the sample size in Figure 3. 

The speed advantage of 0-IPOD over IRLS in this simulation tended to increase 
with n and p. At = 1000 the hard-IPOD algorithm required about 2 or 3 times 
(averaged over all lambda values) as many iterations as IRLS but was about 10 
times faster than IRLS, due to faster iterations. For small n, we saw little speed 
difference. As remarked above, IRLS was sometimes unstable, with singular WLS 
problems arising for redescending ^/;, when fewer than p weights were nonzero. Such 
cases cause only a small problem for 0-IPOD: the update H'y'^^^ cannot then take 
advantage of sparsity, but it is very stable. 

Although we attained impressive computational gain over the popular IRLS, we 
consider speed to be secondary compared to robustness (and the speed advantage of 
0-IPOD can be moot when the preliminary method is very expensive). 

6 Parameter Tuning in Outlier Detection 

The parameter A in an M-estimators (4.4) is often chosen to be a constant (for all 
n), based on either efficiency or breakdown of the estimator. The value 2.5(3" is popu- 
lar (Rousseeuw & Leroy 1987; Wilcox 2005; Maronna et al. 2006). But as mentioned 
in Section 3, even with no outhers and X = /, a constant A independent of n is 
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Figure 3: Computational time comparison between 0-IPOD and IRLS for problems of 
different sizes. The IRLS computational times are represented by red squares, and the 6- 
IPOD computational times by green circles. The plots, from left to right, top to bottom, 
correspond to the examples in Figure 2 respectively. 

far from optimal (Donoho & Johnstone 1994). It is also hard, in robust regressions, 
to select the cutoff value rj at which to identify outliers, because the residual distri- 
bution is usually unknown. Gervini & Yohai (2002) base an asymptotically efficient 
choice for 77 on a Kolmogorov-Smirnov statistic, but they need to assume that the 
standardized robust residuals are IID A/'(0, 1). 

For 0-IPOD, these two issues correspond to the problem of tuning A. In most M- 
estimators of practical interest, the function satisfies il){t/a\\) oc il){t\\a). Then 
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we may set cr = 1 and carefully tune A. Even with this simplification, a data- 
dependent A is difficult to choose using cross-validation on the (X, y) data. The 
reason is that a large prediction error can indicate either a suboptimal f3 or an outlier. 
Fortunately, turning to the reduced model (4.9) greatly mitigates the problem because 
all (transformed) observations are clean. Then BIG can be applied to {A, y) if the 
proportion of outliers is not large. 

Specifically, because all candidate estimates lie along the 9-IPOD solution path, 
we design a local BIG to apply BIG on a proper local interval of the degrees of 
freedom (DF). First we generate the hard-IPOD solution path by decreasing A from 
||(/ — H)y./ ^Jd,ia,g{I — i3") ||oo to 0. Given A and the corresponding estimate 7(A), 
let nz{\) = {i : 7j(A) 7^ 0}. We rely on model (4.9) and study its variable selection 
to give the correct form of BIG. For hard-IPOD, is an OLS estimate with one 
parameter per detected outlier and the degrees of freedom are given by DF(A) = 
|n2;(A)|. We use BIG with a slight modification: 

BIG*(A) = m log(RSS/m) + A;(log(m) + 1), (6.1) 

where m = n - p, RSS = \\if - Aj\\l = - H){y - and k{X) = DF(A) + 1. 
We have taken k = DF + 1 to account for the noise scale parameter, and BIG* uses 
log(m) + 1 instead of BIG's log(m) because we have found this change to be better 
empirically. Note that the sample size m here is smaller than the dimension of 7. 
According to Ghen & Ghen (2008), similar modifications are necessary to preserve 
the model selection properties of BIG for problems with fewer observations than 
predictors. We would like to apply BIG* on a proper local interval of DF given by 
[ul, i^u]- We take uu < n/2, assuming that the proportion of outliers is under 50%. 
The (DF(A), BIG*(A)) curve sometimes has narrow local minima near the ends of the 
A range. To counter that effect we fit a smoothing spline to the set of data points 
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(DF(A), BIC*(A)) and chose the local minimum with the largest neighborhood. The 
neighborhood size can be determined using the local maxima of the smoothing spline. 
Of course there may be other reasonable ways to counter that problem. 

We carried out simulation experiments to test the performance of the tuned 0- 
IPOD. The matrix X was generated the same way as in Section 5 using dimension 
p G {15,50}, n = 1000 observations of which the first O G {200,150,100,50,10} 
were outliers at the highly leveraged location given by L G {15,20} times a vector 
of Is. The outliers were generated by a mean shift 7 = ({5}*^, {0}""'^)"'" added to y. 
Because 0-IPOD is affine, regression, and scale equivariant and the outliers are at a 
single X value, we may set /3 = without loss of generality. The intercept term is 
always included in the modeling. 

Five outlier detection methods were considered for comparison: hard-IPOD (tuned), 
MM-estimator, Gervini & Yohai's (2002) fully efficient one-step procedure (denoted 
by GY), the compound estimator SIS, and the LTS. (The direct procedures, such 
as Hadi & Simonoff (1997), Pena & Yohai (1995), behaved poorly and their results 
were not reported.) The S-PLUS Robust library provides implementations of MM, 
GY, and LTS; for the implementation of SIS, we refer to Wilcox (2005). Robust 
also provides a default initial estimate with high breakdown point (not necessarily 
efficient), which is used in the first three algorithms in our experiments: when p = 15, 
it is the S-estimate computed via random resampling; when p = 50, it is the estimate 
from the fast PY procedure. Since it is well known that the initial estimate is outper- 
formed by MM in terms of estimation efficiency and robustness, both theoretically 
and empirically (Yohai 1987; Salibian-Barrera & Yohai 2006; Pena & Yohai 1999), 
its results are not reported. (All of these routines are available for the R language (R 
Development Core Team 2007) as well. The package robust provides an R version of 
the Insightful Robust library.) 
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All methods apart from 0-IPOD require a cutoff value to identify which residuals 
are outliers. We applied the fully efficient procedure which performs at least as well 
as the fixed choice of 77 = 2.5 in various situations (Gervini & Yohai 2002). 

Each model was simulated 100 times. We report outlier identification results for 

each algorithm using three benchmark measures: 
M the mean masking probability (fraction of undetected true outliers), 

S the mean swamping probability (fraction of good points labeled as outliers), 

JD the joint outlier detection rate (fraction of simulations with masking). 

In outlier detection, masking is more serious than swamping. The former can 
cause gross distortions while the latter is often just a matter of lost efficiency. 

Ideally, M ^ 0, S ~ 0, and JD ^ 100%. JD is the most important measure on 
easier problems while M makes the most sense for hard problems. The simulation 
results are summarized in Tables 1 and 2. Figures 4 and 5 present M and JD for 
p = 15 and 50 respectively. While our main purpose is to identify outliers, a robust 
coefficient estimate /3 can be easily obtained from 6-IPOD. The MSE in f3 for p = 15 
is shown in Figure 6. All methods had small slope errors, though 6-IPOD and GY 
performed best. The results for p = 50 (not shown) are similar. 

MM and GY are two standard methods provided by the S-PLUS Robust library. 
Nevertheless, as seen from the Tables 1 and 2, the MM-estimator, though perhaps 
most popular in robust analysis, does not yield good identification results when the 
outliers have high leverage values and the number of outliers is not small (for example, 
O = 200, L = 15,20). GY improves MM a lot in this situation and gives similar 
results otherwise. The experiments also show that SIS behaves poorly in outlier 
detection and LTS works better in the presence of < 5% outliers. Unfortunately, all 
four methods have high masking probabilities and very low joint identification rates, 
which become worse for large p. The 0-IPOD method dominates them significantly 
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Figure 4: Masking (M) and joint detection (JD) results when p = 15. 
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Figure 5: Masking (M) and joint detection (JD) when -p = 50. 
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Table 1: Outlier identification results on simulated data with p = 15. Five methods are 
compared: hard-IPOD (tuned), MM-estimator, Gervini & Yohai's fully efficient one-step 
procedure, SIS, and LTS. Leverage L and number of outliers O are described in the text as 
are the quality measures JD (joint detection), M (masking) and S (swamping). 

for masking and joint detection. 

To judge statistical significance of these MC results we constructed paired t statis- 
tics based on our 100 replicates. The numerator in each was the number of outliers 
missed by a competing method minus the number missed by G-IPOD; the denomina- 
tor was the standard error of the numerator. Most of the t-statistics were larger than 
3 and grew rapidly with O. LTS versus 0-IPOD had the smallest t-statistic (L2 for 
O = 15) but also the largest (~ 1700 for O = 200). While O-IPOD does much better 
on masking, it is slightly worse for swamping. This is an acceptable tradeoff because 
masking causes far more harm. 
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Table 2: Outlier identification results on the simulation data where p = 50. Methods, 
conditions and results are the same as in Table 1. 



7 Outlier Detection with p > n 



Here we extend outlier detection to problems with p > n including of course some 
high-dimensional problems. This context is more challenging but has diverse modern 
applications in signal processing, bioinformatics, finance, and neuroscience among 
others. Performing the task of outlier identification for data with p > n ot even 
p ^ n goes beyond the traditional robust analysis which requires a large number of 
observations relative to the dimensionality. 

The large-p version of G-IPOD is indeed possible under our general framework 
(1.2). li (3 eMP is also assumed to be sparse in the mean shift outlier model, one can 

/3 



directly work in an augmented data space y ~ with B = [X I] and ^ 
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to study this large-p sparsity problem. Concretely, the TISP technique can be used 
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Figure 6: Coefficient estimation errors when p = 15. 
to derive the O-IPOD iteration = Q{B'^y/k^ + (I - B'^ B /k^)^^^^; A/fcg), or 



and 



(7.1) 



The convergence of (7.1) is guaranteed even for n < p (She 2009), as long as [X /] is 
scaled properly, which here amounts to taking ko > a/||X||| + 1. Let ^ and 7 denote 
the estimates from (7.1). The nonzero components of f3 locate the relevant predictors 
while 7 identifies the outliers and measures their outlyingness. (We could also apply 
(7.1) when p < n for simultaneous variable selection and outlier detection.) Not 
surprisingly, Li (or soft-thresholding) fails again for this challenging sparsity problem 
and thresholdings corresponding to redescending ip^s should be used in (7.1). It is 
natural to consider the elastic net (Zou & Hastie 2005) for this problem as well. But 
the elastic net has a convex criterion, and it can be shown to have breakdown zero. 
Since it is usually unknown how severe the outliers are, one may adopt the fol- 
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lowing hybrid of hard-thresholding and ridge-thresholding 

0, if < A 
e(x; A, //)=<( (7.2) 

referred to as the hybrid-thresholding in She (2009), or hard-ridge thresholding in 
this paper. The corresponding penalty from the three-step construction is 



Pe{x; X,r]) 



-ix^ + Alxl, if Ixl < 
2 'II' II i+n 



Setting q{x; A,?]) = ^^(|x| — A)^lo<|a;|<A, we get an alternative penalty 

1 A^ 1 
P{x;\,r]) = -^^1x^0 + -Vx'^- (7.3) 

From (7.3), we see that hard-ridge thresholding successfully fuses the Lo-penalty and 
the L2-penalty (ridge-penalty) via 6-thresholding. The Lq portion induces sparsity, 
while the L2 portion shrinks f3. The difference from hard thresholding is as follows. 
When hard thresholding makes 7j 7^ we get % = rj, removing any influence of 
observation i on /3. With hard-ridge thresholding, the influence of the observation 
can be removed partially, with the extent of removal controlled by 77. This is helpful 
for nonzero but small |7i| corresponding to mild outlyingness. In addition, the L2 
shrinkage also plays a role in estimation and prediction. 

The large p setting brings a computational challenge. Because the large-p sparse 
G-IPOD can be very slow, we used the following proportional 0-IPOD to screen out 
some 'nuisance dimensions' with coefficients being exactly zero (due to our sparsity 
assumption). Concretely, at each update in (7.1), A was chosen to get precisely 
an nonzero components in the new (jfl^"'^^^, 7*^-^"'"^^). We used a = 0.75 though other 
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choices could be made. Proportional 6-IPOD yields at most an candidates predictors 
to have nonzero coefficients, making the problem simpler. Next, we run (7.1) with just 
those predictors, getting a solution path in A and choosing A by BIG*. In proportional 
G-IPOD, a variable that gets killed at one iteration may reappear in the fit at a later 
iteration. This is quite different from independent screenings based on marginal 
statistics like FDR (Benjamini & Hochberg 1995) or SIS (Fan & Lv 2008). 

For our example we used the proportional hard-ridge IPOD. Specifically, for rj in 
a small grid of values, we ran proportional B-IPOD using the hard-ridge thresholding 
function 0. We chose rj by BIG*. 

We perform the joint robust variable selection and outlier detection on the sugar 
data of Brown et al. (1998). The data set concerns NIR spectroscopy of compositions 
of three sugars in aqueous solution. We concentrate on glucose (sugar 2) as the 
response variable. The predictors are second derivative spectra of 700 absorbances 
at frequencies corresponding to wavelengths of llOOnm to 2498nm in steps of 2nm. 
There are 125 samples for model training. A test set with 21 test samples is available. 
These 21 samples were specially designed to be difficult to predict, with compositions 
outside the range of the other 125. See Brown (1993) for details. 

Brown et al. (1998) used an MGMG Bayesian approach for variable selection, but 
only 160 equally spaced wavelengths were analyzed due to the computational cost. We 
used all 700 wavelengths in fitting our sparse robust model that also estimates outliers. 
The problem size is 125 x (700 + 125). After the screening via proportional hard-ridge- 
IPOD, we ran hard-ridge-IPOD and tuned it as follows. First we set A = which turns 
7] into an ordinary ridge parameter. We fit that ridge parameter i], obtaining rj* by 
minimizing BIG* of Section 6. Then for each rj in the grid {0.5?7*, 0.05?7*, 0.005?7*}, we 
found A(?7) to minimize BIG*, and finally chose among the three (A(?7), 77) combinations 
to minimize BIG*. We think there is no reason to use 77 > 77* and a richer grid than 
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the one we used would be reasonable, but we chose a small grid for computational 
reasons. Our experience is that even small t] values improve prediction. 
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Figure 7: Sparse coefficients and outlier shifts on the sugar data. 



Figure 7 plots the final estimates f3 and 7. From the left panel, we see that the 
estimated model is sparse. It selects only 15 of the wavelengths. The estimate 7 
shown in right side of Figure 7 suggests that observation 99 might be an outlier. 
We found 799 = 1.7706 and r99 = i/Qg — xjg^ = 1.7711, which indicates that this 
observation is (almost) unused in the model fitting. Our model has good prediction 
performance; the mean-squared error is 0.219 on the test data, improving the reported 
MCMC results by about 39%. The robust residual plot is shown in Figure 8. 



8 Discussion 

The main contribution of this paper is to consider the class of B-estimators (She 2009), 
under the mean shift outlier model assumption, defined by the fixed point 7 = 
0(i?7 + (/ — H)y; A) which can be directly computed by 0-IPOD. With a good 
design of 6, we successfully identified the outliers as well as estimating the coefficients 
robustly. This technique is associated with M-estimators, but gives a new characteri- 
zation in form of penalized regressions. Furthermore, we successfully generalized this 
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Figure 8: Residual plot from hard-ridge-IPOD on the sugar data. The apparent outlier, 
observation 99, is shown in red star. 

penalized/thresholding methodology to high- dimensional problems to accommodate 
and identify gross outliers in variable selection and coefficient estimation. 

When outliers are also leverage points, the Gram matrix of the reduced model 
(4.9), i.e., A^A = I - H, may demonstrate high correlation between clean obser- 
vations and outliers for small hi. It is well known from recent advances of the lasso 
(e.g. the ir represent able conditions (Zhao & Yu 2006) and the sparse Riesz condi- 
tion (Zhang & Huang 2008)) that the convex Li-penalty encounters great trouble 
in this situation. Rather, nonconvex penalties which correspond to redescending ip^s 
must be applied, with a high breakdown point initial estimate obtained by, say, the 
fast-LTS (Rousseeuw & Van Driessen 2002) or fast-S (Salibian-Barrera & Yohai 2006) 
when p is small, or the fast PY procedure (Pefia & Yohai 1999) when p is large. 

In this framework, determining the efficiency parameter in M-estimators and 
choosing a cutoff value for outlier identification are both accomplished by tuning 
the choice of A. Our experience shows that adopting an appropriate data-dependent 
choice of A is crucial to guarantee good detection performance. 

We close with a note of caution on an important issue for the outlier detection 
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literature. We have found that our robust regression algorithms work better than 
competitors on our simulated data sets and that they give the right answers on some 
small well studied real data sets. But our method and the others we study all rely on 
a preliminary robust fit. The most robust preliminary fits have a cost that grows ex- 
ponentially with dimension and for them p = 15 is already large. In high dimensional 
problems, the preliminary fit of choice is seemingly the PY procedure. It performed 
well in our examples, but it does not necessarily have high breakdown. Should the 
preliminary method fail, foUowup methods like 6-IPOD or the others, may or may 
not correct it. We have seen hard-IPOD work well from a non-robust start (e.g. the 
HBK problem) but it would not be reasonable to expect this will always hold. We 
expect that 0-IPOD iterations will benefit from improvements in preliminary robust 
fitting methods for high dimensional problems. For low dimensional problems robust 
preliminary methods like LTS are fast enough. 



A Proofs 



• Proof of Proposition 3.1 

Since both Ruber's method and the Li-penalized regression are convex, it is sufficient 
to consider the KKT equations. In this proof, we define 



e(t;A) = < 



t- \, if t > A 

0, if |t| < A . ^(^;^) 

-t + A, if t < -A 



A, if t > A 



t, if \t\ < A 



-A, ift<-A 



Obviously, 6(t; X) + ip(t] A) = t. For simplicity, we use the same symbols for the vector 
versions of the two functions: Q{t] A) = [Q{ti, X)]"^^, ipit; A) = [tpiU, A)]^=i, Vt G M". 
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The KKT equations for Ruber's estimate (/3, a) are given by 



X-Jy^-X] = (A,l) 



a 

MtA^--^)"]^- ^ (A.2) 



Let r = y — XjS, G = {i : \fi\ < Aa}, and O = {i : \fi\ > Xa}. Since 

PI -tVa^, if Itl < Xa 

— (2p(t/a;A)a) = 

I-A2, ii\t\>Xa 
(A.2) becomes nc = A^|0| + J^ied^i/^'^- summarize, a) satisfies 

= X> and (A.3) 

= \\rG\\l/icn~X'\d\). (A.4) 

Next, the joint KKT equations for Li-penahzed regression estimates 7, a) are 



7 = 0(7/ — Xf3; Xa) and 



1 



0= ||z/-X/3-7|| ■(--)+ nc. 



from which it follows that X'^i/j(^{y — X(3)/d'; A) = 0, and o"^ = \\r — 'jW^/nc. But 
11^ ~ 7II2 = Z^ieG^*^ + A^(T^|6|, so we obtain 



= ^'^^ ( ^-^^;A 1 , and 



a' = \\r^\\l/{cn-X'\0\), 
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which are exactly the same as (A. 3) and (A. 4). Therefore, (3.3) leads to Ruber's 
method with joint scale estimation. When a is fixed, it is not difficult to show that 
the two minimizations still yield the same /3-estimate. 

Finally, Huber notices his method behaves poorly even for moderate leverage 
points and considers an improvement of using Ciipifi/ {c2<j)) to replace iplvi/a) (Huber 
1981). He claims, based on heuristic arguments, that ci = C2 = \J\ — hi is a good 
choice. This is quite natural as seen from our new characterization. The regression 
matrix for 7 in (3.3) is actually I — H with the column norms given by \/l — hi. The 
weighted Li-penalty of ^(A-\/l — hi) x |7j| then leads to Huber's improvement, due 
to the fact that Kipit / K\\) = il){t\KX) for K > {). (However it cannot completely 
avoid masking and swamping unless a redescending ip is used.) □ 

• Proof of Proposition 4.1 

By definition, for any G-IPOD estimate (/3,'y), 7 is a fixed point of 7 = Q{H^ + 
(/ - H)y; A), and ^ = {X^Xy^X'^iy - 7). It follows that 

X'^^iy - XP) = X''ij{y - H{y - 7); A) 

= X\iil - H)y + ff7) - ©((/ - H)y + H^)- A)) 
= X''{{I-H)y + H^-^) 
= X"^(/-ff)-(2/-7) = 0, 

and so (3 is an M-estimate associated with ip. □ 

• Proof of Theorem 4.1 

The second inequality in (4.3) is straightforward from the algorithm design. To show 
the first inequality is true, it is sufficient to prove the following lemma. 

Lemma A.l. Given a thresholding rule 6, let P he any function satisfying P{6; A) = 
P(0; A) + Pe{9] A) + q{9; A) where g(-; A) is nonnegative and q{Q{6; A)) = for all 9. 
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Then, the minimization problem mm0{t — 9)"^ / 2+ P [9] A) has a unique optimal solution 



This is a generalization of Proposition 3.2 in Antoniadis (2007). Note that P (and 
Pe) may not be differentiable at and may not be convex. 

Without loss of generality, suppose t > 0. It suffices to consider 6* > since 



f{0) > f{-9),\/9 > 0, where f{9) = {t - 9f /2 + P{9; X). First, given by (4.f) is 



Suppose 9 > 0(t;A). By definition Q-\9; X) > t, and thus f{9) > /(e(t;A)). 
There must exist some u E [Q{t;X),9) s.t. Q-^{u;X)>t. Otherwise we would have 
6(t'; X) > 9 > 6(t; A) for any t' > t, and thus G(-) would be discontinuous at t. 
Noticing that e~^(-) is monotone, /e(t.A)(0"Hw; A) -t)du> 0, or f{9) > f{<d{t; A)). 
A similar reasoning applies to the case 9 < Q(t; X). The proof is now complete. □ 
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